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Abstract 

Background: The aim of the article is demonstrating an application of multiple imputation (MI) for handling missing 
clinical data in the setting of rheumatologic surveys using data derived from 10291 people participating in the first 
phase of the Community Oriented Program for Control of Rheumatic Disorders (COPCORD) in Iran. 
Methods: Five data subsets were produced from the original data set. Certain demographics were selected as complete 
variables. In each subset, we created a univariate pattern of missingness for knee osteoarthritis status as the outcome 
variable (disease) using different mechanisms and percentages. The crude disease proportion and its standard error were 
estimated separately for each complete data set to be used as true (baseline) values for percent bias calculation. The 
parameters of interest were also estimated for each incomplete data subset using two approaches to deal with missing 
data including complete case analysis (CCA) and MI with various imputation numbers. The two approaches were 
compared using appropriate analysis of variance. 

Results: With CCA, percent bias associated with missing data was 8.67 (95% CI: 7.81-9.53) for the proportion and 
13.67 (95% CI: 12.60-14.74) for the standard error. However, they were 6.42 (95% CI: 5.56-7.29) and 10.04 (95% CI: 
8.97-11.11), respectively using the MI method (M=15). Percent bias in estimating disease proportion and its standard 
error was significantly lower in missing data analysis using MI compared with CCA (P< 0.05). 

Conclusion: To estimate the prevalence of rheumatic disorders such as knee osteoarthritis, applying MI using available 
demographics is superior to CCA. 
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Introduction 



Missing data is an unavoidable challenge in most 
epidemiologic researches and occurs under various 
mechanisms (1). Missing completely at random 
(MCAR) refers to a condition where missingness 
is not related to the studied variables. In missing at 
random (MAR), data is missing at random 
conditionally, and although unrelated to the varia- 



ble of interest, it is related to other study variables. 
Missing not at random (MNAR) is the case where 
missingness depends on the values of the variable 
of interest (2-3). 

In cross-sectional surveys like any other type of 
observational studies, missing data can be due to 
incomplete responses and low rate of respondents' 



* Corresponding Author: Tel: +98-21 88991109, E-mail address: holakoin@tums.ac.ir 



Minnohammadkhani et ah: Multiple Imputation to Deal with Missing Clinical Data . 



cooperation (4-5). However, probability of 
missingness is not equal for all variables; those 
collected by methods that are less costly and less 
reliant on participant cooperation are also less 
likely to have missing data. For example, demo- 
graphics can be collected through simple ap- 
proaches which are less dependent on subject 
participation, while clinical data such disease 
status would at least require taking a medical his- 
tory and performing physical exam, and in some 
cases, it may be possible only by utilizing expen- 
sive, invasive or time consuming diagnostic proce- 
dures as well as subject consent and participation 
in every stage. 

Rheumatologic studies also are not exempt. As a 
typical example, we can refer to the first phase of 
the Community Oriented Program for Control of 
Rheumatic Disorders (COPCORD) performed in 
Tehran the capital of Iran in 2005 by the 
Rheumatology Research Center of Tehran Univer- 
sity of Medical Sciences in collaboration with the 
World Health Organization (WHO) and the 
International League of Associations for 
Rheumatology (ILAR) to determine the pattern of 
rheumatic complaints and disorders in this region. 
As the first step of data gathering procedure, a 
short preliminary interview was performed by 
trained health care providers to find eligible 
individuals in each random selected household 
considering their demographic characteristics. 
Then, selected participants were approached at 
their homes to gather main clinical data on their 
rheumatic complaints and disorders through verbal 
interview, and consenting participants had a physi- 
cal exam and diagnostic tests by trained physicians 
and clinicians. In case they were absent from home, 
attempts were repeated for up to two more times 
before being excluded from the study. From 13741 
eligible people, 582 individuals (4.23%) refused to 
participate and 2868 subjects (20.87%) were not 
reached despite multiple attempts. Eventually, we 
had data on demographics of 13741 participants, 
but clinical data on 3450 people (25.11%) was 
missing. Data was collected on the first or second 
attempt in 8401 cases (81.6%), and the third at- 



tempt for 1890 cases (18.4 %). A more detailed 
methodology is presented elsewhere (6-10). 
Considering the context of a study, percentage of 
missing data and the missing data mechanism 
influence the level of errors due to missingness. 
Fortunately, in many situations it is possible to re- 
duce bias and provide more precise findings to 
deal with missing data using special methods and 
software (11). Some techniques are based on data 
repair in which the missing values are imputed 
with appropriate values based on observed data. In 
some imputing methods, the missing value is im- 
puted with just one fitted value. These methods are 
classified as single imputation (SI). As an impor- 
tant drawback, no variation is assumed for the fit- 
ted value in SI, and this leads to an overestimated 
study precision (12). However in multiple imputa- 
tion (MI), contrary to SI, fitted values are permit- 
ted to vary. MI is popular with researchers in vari- 
ous fields as a novel and efficient method (13). In 
brief, the method has three phases. In the first 
phase, the missing data are imputed M times to get 
M complete datasets. In the second phase, these 
datasets are analyzed separately, using methods of 
interests. In the final step, the results obtained from 
M analyses are combined to draw a single infer- 
ence using certain rules (14). Flexibility and effi- 
ciency are the most prominent characteristics of 
MI which make it more favorite (15). Although MI 
has been utilized in many fields increasingly (16), 
the method is quite uncommon in some fields of 
epidemiology despite its advantages, especially in 
large data sets (5, 17-19). 

One question needs to be replied; in COPCORD 
ongoing studies, when the aim of the study is to 
determine the prevalence of musculoskeletal disor- 
ders, can we suggest using the MI as a novel 
imputing method instead of making inferences af- 
ter excluding incomplete cases and running com- 
plete case analysis (CCA)? The objective of the 
present analysis was to contrast CCA and MI 
across three missing data mechanisms and differ- 
ent proportions of missing data in the context of 
COPCORD study. 
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Materials and Methods 

We used the data of 10291 participants in the 
first phase of the COPCORD in Iran with com- 
plete demographic and clinical data. Presence 
or absence of knee osteoarthritis, as an impor- 
tant and typical rheumatic disease, was set as 
outcome variable subject to missingness. Of 
them, 1532 (14.89%) people were diagnosed 
with knee osteoarthritis. Age, sex, marital status, 
education level, and occupation were consid- 
ered as variables with no missing value. Using 
all observations of whole data set, five indepen- 
dent data subsets were generated based on city 
zones (north, south, west, east and center) to 
make inferences (Table 1). 
To create a univariate pattern of missing data, 
we deleted some entries for the outcome varia- 
ble (diagnosis of knee osteoarthritis) in each 
database while all observations related to demo- 
graphics were retained. In this step 10, 15, 20 
and 25 percent of values related to disease sta- 
tus were dropped from the database of each 
data subset separately and independently 
following three mechanisms. 1) MCAR. 2) 
Random deletion of entries indicating absence 
of disease (with no knee osteoarthritis) assum- 
ing a lower participation rate for healthier peo- 
ple. This mechanism can be considered MNAR. 
3) Random deletion of entries collected on the 
third (and second) attempts regarding to real 
situation of missing data in the COPCORD 
study context. We named this mechanism as 
non -response. Hence 60 incomplete data sub- 
sets were generated. 

The parameters of interest were the crude 
proportion of knee osteoarthritis and its stan- 
dard error. In incomplete databases, estimates 
were calculated by two approaches; CCA which 
is done after deleting cases with missing values, 
and MI in which missing values were imputed 
M times (in our study 5, 10, 15, and 20 times) 
using all other observed values. Therefore, we 
had 5 complete data subsets and 60 with miss- 
ing data. After dealing with missing values, we 
had 300 analyses, 60 of which were treated 



with CCA and 240 with MI where M ranged 
between 5 and 20. For MI, we utilized Stata's 
ice command to perform multiple imputation by 
chained equations (MICE) which imputes miss- 
ing values by using switching regression, an 
iterative multivariable regression technique. 
Estimated parameters from each complete data- 
base were set as accepted true values (V t ). For 
estimation of crude disease proportion and its 
standard error we used the Wald method for 
binomial distribution (presence or absence of 
knee osteoarthritis) (Table 1). 
We calculated the percent bias associated with 
missing data for disease proportion and its stan- 
dard error in all 300 data subsets in which mis- 
singness was created. To measure the percent 
bias, we considered the absolute difference be- 
tween the value obtained by analysis of each 
incomplete data subset (VO and the correspond- 
ing V t expressed as a percentage of the true 
value (percent bias = 100 * [IV; - V t I / V, ]). We 
compared the percent bias remaining after 
applying MI and CCA, as well as the effect of 
other factors of interest including the percen- 
tage of missing data (ranging between 5 and 25), 
the missing data mechanism (including MCAR, 
MNAR and non-response), and the interaction 
among them. For this purpose, we utilized 
analysis of variance (ANOVA) with general 
linear model (GLM) for repeated measures in 
the SPSS software version 16. The procedure 
provides analysis of variance when the same 
measurement is made several times on each 
subject in different periods or conditions. In our 
study, each one of the 300 databases was a spe- 
cial case of the main five generated data subsets. 
All tests were done at a confidence level of 
95%, and Bonferroni correction was applied in 
multiple comparisons. 

Results 

In our analysis, the grand mean of percent bias 
associated with missing data was 6.94 (95% CI: 
6.55-7.33) for the disease proportion and 10.99 
(95% CI: 10.51-11.46) for the standard error. Fig- 
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ure 1 shows mean percent bias for the proportion 
(A and C) and standard error (B and D) with 
different missing data mechanisms (A and B) and 
missing data handling methods (C and D) sepa- 
rately for each data subset. 
The effect of missing data handling methods on 
percent bias using the ANOVA is summarized in 
Table 2. The percent bias in estimating the disease 
proportion was significantly affected in the data 
subsets (P<0.05), but not its standard error (P> 
0.2). In estimating the proportion, the missing data 
mechanism (P=0.01) and missing data percentage 
(P<0.001) as well as their interaction (P<0.001) 
had significant effects on bias. In estimating the 
standard error, bias was significantly affected by 
missing data percentage (P<0.001) and its interac- 
tion with missing data mechanism (P<0.001), but 
not by the missing data mechanism per se 
(P=0.07). The missing data handling method 
significantly affected the bias in estimating both 
the proportion (P=0.02) and its standard error 
(P<0.001). Comparing the values of Partial Eta 
Squared revealed that in estimating the proportion, 
the interactive effect of missing data percentage 
and mechanism was greatest on the percent bias. 
But in estimating the standard error, the missing 
percentage was the most effective factor on per- 
cent bias. 

Table 3 presents the percent bias and its confi- 
dence intervals for each missing data mechanism 
and for each handling method separately. The 
highest value pertained to the MNAR mechanism 
(16.56, 95% CI: 15.89-17.22 for proportion 
estimation and 14.42, 95% CI: 13.59-15.25 for 
standard error estimation) and the smallest value 



was with MCAR (1 .99, 95% CI: 1 .32-2.66 for pro- 
portion estimation and 8.96, 95% CI: 8.13-9.79 for 
standard error estimation). The value for non-re- 
sponse was between the above two (2.28, 95% CI: 

I. 61-2.95 for proportion estimation and 9.58, 95% 
CI: 8.75-10.41 for standard error estimation). As 
stated in Table 3, using CCA rather than MI to 
deal with missing data resulted in greater levels of 
percent bias. The highest percent bias values were 
with CCA (8.67, 95% CI: 7.81-9.53 for proportion 
estimation and 13.67, 95% CI: 12.60-14.74 for 
standard error estimation) and the smallest per- 
tained to MI with M=15 (6.42, 95% CI: 5.56-7.29 
for proportion estimation and 10.04, 95% CI: 8.97- 

I I . 1 1 for standard error estimation). 

Table 4 presents the mean difference between 
CCA and MI with different imputation numbers 
(M) in terms of percent bias in estimating the para- 
meters of interest; differences between the two ap- 
proaches were statistically significant (P< 0.05) for 
every M in estimating both the proportion and the 
standard error. However no significant differences 
were found between various imputation numbers. 
Table 5 demonstrates results of pairwise compari- 
sons between percent biases for parameters of 
interest with different missing data mechanisms. 
Significant differences were found between 
MNAR and MCAR mechanisms in estimating 
both the proportion (P<0.001) and the standard 
error (P<0.001). In addition, there was a signifi- 
cant difference between MNAR mechanism and 
missing data due to non-response in estimating 
both parameters of interest (P<0.001), but not be- 
tween MCAR and non-response (P>0.9). 



Table 1: Population, Number of knee osteoarthritis cases and accepted true values of the parameters of interest 

(proportion and standard error) in each generated data subset 

Data subset Population (No. of Cases) True values (V t ) 



Proportion (%) Standard error 



1 


768(103) 


13.4 


0.012 


2 


2965(469) 


15.8 


0.007 


3 


1809(271) 


15.0 


0.008 


4 


2911(445) 


15.3 


0.007 


5 


1838(244) 


13.3 


0.008 


Total 


10291(1532) 


14.9 


0.003 
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Table 2: Results of AN OVA to test the effects on the level of percent bias in estimating the parame- 
ters of interest (proportion and standard error) 









Proportion Standard Error 










Partial Eta 


Source of effects a 


P 


Partial Eta Squared P 


Squared 


Data subset 




<0.001 


0.218 0.3 


0.019 


Data subset*Mechanism 


0.001 


0.243 0.3 


0.046 


Data subset*Method 




0.007 


0.239 0.3 


0.088 


Data subset*Percent 




0.004 


0.152 0.3 


0.017 


Data subset*Mechanism*Percent 


0.003 


0.212 0.3 


0.050 


Mechanism 




0.01 


0.164 0.07 


0.097 


Method 




0.002 


0.288 <0.001 


0.397 


Percent 




<0.001 


0.758 <0.001 


0.869 


Mechanism*Percent 




<0.001 


0.809 <0.001 


0.378 


Intercept 




0.3 


0.017 0.002 


0.178 


a Lower-bound epsilon 


is used for adjustment to the numerator and denominator degrees of freedom in order to validate the 


univariate F statistic 










Table 3: Percent bias and its 95% confidence interval in estimating the parameters of interest (propor- 


tion and standard error) regarding 


; different missing data mechanisms and handling 


I methods 








95 % Confidence Interval 


Parameter 


Method / Mechanism 


Percent bias Lower Bo und 


Upper Bound 


Proportion 


MI a (M=5) 




6.515 5.651 


7.379 




MI (M=10) 




6.549 5.685 


7.413 




MI (M=15) 




6.424 5.561 


7.288 




MI (M=20) 




6.559 5.695 


7.423 




CCA b 




8.670 7.807 


9.534 




Non-response 




2.283 1.614 


2.952 




MNAR C 




16.556 15.886 


17.225 




MCAR d 




1.992 1.323 


2.661 


Standard Error 


MI (M=5) 




10.338 9.268 


11.408 




MI (M=10) 




10.762 9.692 


11.831 




MI (M=15) 




10.042 8.972 


11.112 




MI (M=20) 




10.114 9.045 


11.184 




CCA 




13.673 12.603 


14.743 




Non-response 




9.578 8.750 


10.407 




MNAR 




14.420 13.591 


15.249 




MCAR 




8.959 8.131 


9.788 


a Multiple Imputation, 


b Complete case analysis, c Missing not at random, d Missing completely at random 
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Table 4: Mean difference in percent bias between CCA and MI with different imputation numbers (M) in estimating the 

parameters of interest (proportion and standard error) 



Parameter M 


Difference in percent bias 


pa 
















95% Confidence Interval 










Lower Bound Upper Bound 


Proportion 5 




2.16 


0.009 


0.37 


3.94 


10 




2.12 


0.010 


0.33 


3.91 


15 




2.25 


0.005 


0.46 


4.03 


20 




2.11 


0.01 


0.32 


3.90 


Standard error 5 




3.33 


0.001 


1.12 


5.55 


10 




2.91 


0.003 


0.70 


5.12 


15 




3.63 


<0.001 


1.42 


5.84 


20 




3.56 


<0.001 


1.35 


5.77 


a Bonferroni adjustment 














Table 5: Pairwise comparison of estimation percent bias for parameters of interest (proportion and standard error) as calcu- 




lated with different missing data mechanisms 
















95 % Confidence Interval 








Difference in per- 




Lower 






Parameter I 


J 


cent bias (I-J) 


pa 


Bound Upper Bound 




Proportion Non-response 


MNAR b 


-14.27 


<0.001 


-15.44 


-13.11 






MCAR C 


0.29 


1.00 


-0.88 


1.46 




MNAR 


MCAR 


14.56 


<0.001 


13.40 


15.73 




Standard Error Non-response 


MNAR 


-4.84 


<0.001 


-6.29 


-3.40 






MCAR 


0.62 


0.9 


-0.83 


2.06 




MNAR 


MCAR 


5.46 


<0.001 


4.01 


6.91 




a Bonferroni adjustment, b Missing not at random, c Missing completely at random 










Don subset 

Fig. 1: Percent bias for the proportion (A and C) and standard error (B and D) with different missing data 

mechanisms (A and B) and missing data handling methods (C and D) separately for each data 
subset.(MNAR=Missing not at random, MCAR=Missing completely at random, MI=Multiple Imputation, 

CCA= Complete case analysis) 
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Discussion 



The main goal of performing research and 
statistical analysis of data is getting accurate 
and valid estimations about a population of 
interest. But, missing data often occurs and it 
may leave an undesirable impression on study 
results (20). If missing data can be assumed 
MCAR, ignoring it and running CCA can lead 
to loss of information, reduced sample size and 
diminished study precision (3). Subjects' unwil- 
lingness or lack of cooperation depends on 
numerous and complex causes which may asso- 
ciate with the study variables directly or indi- 
rectly. For example, in many situations, heal- 
thier individuals may be less inclined to partici- 
pate or cooperate in a research. Therefore, the 
MCAR assumption is usually violated, and 
CCA not only reduces study precision, but also 
may lead to biased results. With MI, acceptable 
missing data mechanisms are MCAR or MAR 
(MAR assumption); otherwise it should be used 
and interpreted with caution (14). Since MAR 
assumption is not testable with observed data, it 
is potentially useful to consider perturbation 
(sensitivity) analysis of consequences of depar- 
ture from the MAR assumption. The result of 
current analyses revealed that clinical missing 
data (diagnosis of knee osteoarthritis) can cre- 
ate bias in estimating both the crude proportion 
of disease and its standard error. The COP- 
CORD dataset had about 25% missing data; of 
the selected samples, 4.2% were unwilling to 
participate and 20.8% were unreachable despite 
3 attempts. In the worst case scenario, if 
MNAR has to be assumed, running both CCA 
and MI can threaten study results by introduc- 
ing serious bias. According to our findings, 
where missing data was mainly due to unreach- 
ability, the level of percent bias was not signifi- 
cantly different from that with MCAR. There- 
fore this finding suggests that clinical missing 
data due to unreachability should not be consid- 
ered as MNAR, and thus applying MI is 
appropriate. However, the case is more compli- 



cated with those who refused to participate in 
the first place. Some reports suggest MI can 
provide acceptable results even when the miss- 
ing data mechanism is MNAR, although its 
application requires expertise and caution (12, 
20-24). Compared to CCA, our study results 
indicated significantly less percent bias with MI 
data analysis in all situations, and this is in 
agreement with abovementioned reports. In 
light of this observation, as a practical approach 
in settings similar to global COPCORD 
initiative, in which the objective is to determine 
the prevalence of musculoskeletal disorders 
such as osteoarthritis, we suggest making use of 
demographics such as age, gender, education, 
and occupation (which can be collected more 
convenient) to determine the disease status and 
applying MI rather than eliminating cases 
whose disease statuses are missing. 
It must be noted that applying MI may be ques- 
tionable when the assumptions are violated, but 
it does not justify the use of CCA either. In 
other words, despite being more time consum- 
ing and complicated, MI seems to be a better 
choice than CCA even when we have MNAR 
data and expected biased estimates. The aim of 
MI is not making up data at all; on the contrary, 
it is making use of all observations to fix the 
database as far as possible. The more valid and 
appropriate the imputation and estimation 
model, the smaller the level of bias. In our 
study, the intrinsic correlation between knee 
osteoarthritis, as a typical musculoskeletal dis- 
order, and any of the demographics can be a 
logical explanation for the ability of MI to re- 
duce percent bias significantly, and validates its 
application. 

Based on the rules for MI, a minimum of 3 
imputations are necessary before any interpreta- 
tion of results, and theoretically, there is no up- 
per limit. To be practical, most studies have 
used an M equal to 5 or 10 (25). In our study, 
the level of percent bias did not significantly 
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decrease when we changed the number of 
imputations from 5 to 10, 15, or 20. 
MI is probably one of the most accurate and 
useful methods for handling missing data, 
nonetheless, it is not the best. In this study, we 
did not assess other missing data analysis me- 
thods, and thus, we can never recommend MI 
as the best approach for dealing with missing 
data in similar scenarios. However, MI is 
acceptably efficient and flexible, and consider- 
ing the availability of user friendly statistical 
software, it seems only logical to use MI in- 
stead of the traditional CCA. We suggest 
comparing MI with other methods in this set- 
ting. 

We used a single clinical variable to avoid un- 
wanted complexities. This issue influences the 
generalizability of study results. To minimize 
this effect, we chose a typical rheumatic dis- 
order, i.e. knee osteoarthritis which can prop- 
erly represent many musculoskeletal disorders 
in accordance with the analytic objectives of 
the study. We suggest conducting similar 
assessments with other variables in this setting. 
In summary, in a study setting similar to ours, 
neglecting clinical missing data can be a source 
of significant bias in estimating the prevalence 
of musculoskeletal disease such as knee os- 
teoarthritis. A suitable option to reduce the im- 
pact of this issue and increase the accuracy of 
estimates is the use of MI to repair missing 
clinical data based on demographics, and it is a 
better choice than CCA. 
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